430 research outputs found
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Evaluating Large Language Models (LLMs) in open-ended scenarios is
challenging because existing benchmarks and metrics can not measure them
comprehensively. To address this problem, we propose to fine-tune LLMs as
scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in
open-ended benchmarks. We first propose a comprehensive, large-scale,
high-quality dataset containing task seeds, LLMs-generated answers, and
GPT-4-generated judgments for fine-tuning high-performance judges, as well as a
new benchmark for evaluating the judges. We train JudgeLM at different scales
from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its
capabilities and behaviors. We then analyze the key biases in fine-tuning LLM
as a judge and consider them as position bias, knowledge bias, and format bias.
To address these issues, JudgeLM introduces a bag of techniques including swap
augmentation, reference support, and reference drop, which clearly enhance the
judge's performance. JudgeLM obtains the state-of-the-art judge performance on
both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM
is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8
A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an
agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM
also demonstrates extended capabilities in being judges of the single answer,
multimodal models, multiple answers, and multi-turn chat.Comment: 30 pages, 23 figure
Repulsion Loss: Detecting Pedestrians in a Crowd
Detecting individual pedestrians in a crowd remains a challenging problem
since the pedestrians often gather together and occlude each other in
real-world scenarios. In this paper, we first explore how a state-of-the-art
pedestrian detector is harmed by crowd occlusion via experimentation, providing
insights into the crowd occlusion problem. Then, we propose a novel bounding
box regression loss specifically designed for crowd scenes, termed repulsion
loss. This loss is driven by two motivations: the attraction by target, and the
repulsion by other surrounding objects. The repulsion term prevents the
proposal from shifting to surrounding objects thus leading to more crowd-robust
localization. Our detector trained by repulsion loss outperforms all the
state-of-the-art methods with a significant improvement in occlusion cases.Comment: Accepted to IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) 201
Images Speak in Images: A Generalist Painter for In-Context Visual Learning
In-context learning, as a new paradigm in NLP, allows the model to rapidly
adapt to various tasks with only a handful of prompts and examples. But in
computer vision, the difficulties for in-context learning lie in that tasks
vary significantly in the output representations, thus it is unclear how to
define the general-purpose task prompts that the vision model can understand
and transfer to out-of-domain tasks. In this work, we present Painter, a
generalist model which addresses these obstacles with an "image"-centric
solution, that is, to redefine the output of core vision tasks as images, and
specify task prompts as also images. With this idea, our training process is
extremely simple, which performs standard masked image modeling on the stitch
of input and output image pairs. This makes the model capable of performing
tasks conditioned on visible image patches. Thus, during inference, we can
adopt a pair of input and output images from the same task as the input
condition, to indicate which task to perform. Without bells and whistles, our
generalist Painter can achieve competitive performance compared to
well-established task-specific models, on seven representative vision tasks
ranging from high-level visual understanding to low-level image processing. In
addition, Painter significantly outperforms recent generalist models on several
challenging tasks.Comment: Accepted to CVPR 2023. Code and model is available at:
https://github.com/baaivision/Painte
Domain Generalization for Activity Recognition via Adaptive Feature Fusion
Human activity recognition requires the efforts to build a generalizable
model using the training datasets with the hope to achieve good performance in
test datasets. However, in real applications, the training and testing datasets
may have totally different distributions due to various reasons such as
different body shapes, acting styles, and habits, damaging the model's
generalization performance. While such a distribution gap can be reduced by
existing domain adaptation approaches, they typically assume that the test data
can be accessed in the training stage, which is not realistic. In this paper,
we consider a more practical and challenging scenario: domain-generalized
activity recognition (DGAR) where the test dataset \emph{cannot} be accessed
during training. To this end, we propose \emph{Adaptive Feature Fusion for
Activity Recognition~(AFFAR)}, a domain generalization approach that learns to
fuse the domain-invariant and domain-specific representations to improve the
model's generalization performance. AFFAR takes the best of both worlds where
domain-invariant representations enhance the transferability across domains and
domain-specific representations leverage the model discrimination power from
each domain. Extensive experiments on three public HAR datasets show its
effectiveness. Furthermore, we apply AFFAR to a real application, i.e., the
diagnosis of Children's Attention Deficit Hyperactivity Disorder~(ADHD), which
also demonstrates the superiority of our approach.Comment: Accepted by ACM Transactions on Intelligent Systems and Technology
(TIST) 2022; Code:
https://github.com/jindongwang/transferlearning/tree/master/code/DeepD
LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos
Analyzing the interactions between humans and objects from a video includes
identification of the relationships between humans and the objects present in
the video. It can be thought of as a specialized version of Visual Relationship
Detection, wherein one of the objects must be a human. While traditional
methods formulate the problem as inference on a sequence of video segments, we
present a hierarchical approach, LIGHTEN, to learn visual features to
effectively capture spatio-temporal cues at multiple granularities in a video.
Unlike current approaches, LIGHTEN avoids using ground truth data like depth
maps or 3D human pose, thus increasing generalization across non-RGBD datasets
as well. Furthermore, we achieve the same using only the visual features,
instead of the commonly used hand-crafted spatial features. We achieve
state-of-the-art results in human-object interaction detection (88.9% and
92.6%) and anticipation tasks of CAD-120 and competitive results on image based
HOI detection in V-COCO dataset, setting a new benchmark for visual features
based approaches. Code for LIGHTEN is available at
https://github.com/praneeth11009/LIGHTEN-Learning-Interactions-with-Graphs-and-Hierarchical-TEmporal-Networks-for-HOIComment: 9 pages, 6 figures, ACM Multimedia Conference 202
Diverse Knowledge Distillation for End-to-End Person Search
Person search aims to localize and identify a specific person from a gallery
of images. Recent methods can be categorized into two groups, i.e., two-step
and end-to-end approaches. The former views person search as two independent
tasks and achieves dominant results using separately trained person detection
and re-identification (Re-ID) models. The latter performs person search in an
end-to-end fashion. Although the end-to-end approaches yield higher inference
efficiency, they largely lag behind those two-step counterparts in terms of
accuracy. In this paper, we argue that the gap between the two kinds of methods
is mainly caused by the Re-ID sub-networks of end-to-end methods. To this end,
we propose a simple yet strong end-to-end network with diverse knowledge
distillation to break the bottleneck. We also design a spatial-invariant
augmentation to assist model to be invariant to inaccurate detection results.
Experimental results on the CUHK-SYSU and PRW datasets demonstrate the
superiority of our method against existing approaches -- it achieves on par
accuracy with state-of-the-art two-step methods while maintaining high
efficiency due to the single joint model. Code is available at:
https://git.io/DKD-PersonSearch.Comment: Accepted to AAAI, 2021. Code is available at:
https://git.io/DKD-PersonSearc
Overviews of Investigation on Submersible Pressure Hulls
With the exploration of natural resources and the research on oceanography in the deep sea obtained more and more attention, in the recent years, the pressure hull of the submersibles has been widely studied and used in many states. In order to the continuing design and assessment on it effectively, the paper summarizes the design method, the structural feature and the material selection of this object
- …